Using HIP on Setonix¶

Access to Setonix¶

Firstly you need a username and password to access Setonix. Your username and password will be given to you prior to the beginning of this workshop. If you are using your regular Pawsey account then you can reset your password here.

Access to Setonix is via Secure SHell (SSH). On Linux, Mac OS, and Windows 10 and higher, an SSH client is available from the command line or terminal application. Otherwise you need to use a client program like Putty or MobaXterm.

Access with SSH on the command line¶

On the command line use the command ssh to access Setonix.

ssh -Y <username>@setonix.pawsey.org.au

Passwordless login with SSH¶

In order to avoid specifying a username and password on each login you can generate a key and password combination on your computer using the following on the command line.

ssh-keygen -t rsa

Then copy the public key (the file that ends in *.pub) to your account on Setonix and append it to the authorized keys in .ssh. On your machine run this command:

scp -r <filename>.pub <username>@setonix.pawsey.org.au

Then login to Setonix and run this command

mkdir -p ${HOME}/.ssh
cat <filename>.pub >> ${HOME}/.ssh/authorized_keys
chmod -R 0400 ${HOME}/.ssh

Finally, if you are using MacOS or Linux you can add this line to ${HOME}/.ssh/config on your computer

Host setonix
    Hostname setonix.pawsey.org.au
    IdentityFile <private_key_file>
    User <username>
    ForwardX11 yes
    ForwardAgent yes
    ServerAliveInterval 300
    ServerAliveCountMax 2
    TCPKeepAlive no

Then you can run

ssh setonix

without a password.

Access from Windows with the MobaXterm client¶

If you have a OS that is older than Windows 10 and need a client in a hurry, just download MobaXterm Home (Portable Edition) from this location. Extract the Zip file and run the application. You might need to accept a firewall notification.

Now go to Settings -> SSH and uncheck "Enable graphical SSH-browser" in the SSH-browser settings pane. Also enable "SSH keepalive" to keep SSH connections active.

Figure: MobaXTerm settings.

Then close the settings and start a local terminal.

Hardware environment on Setonix¶

On Setonix there are two main kinds of compute nodes:

  • CPU nodes with 2 sockets and 256 threads.
  • GPU nodes with 1 CPU socket with 128 threads and 4 MI250X GPU sockets. Each GPU socket contains two GPU compute devices.

CPU nodes¶

CPU nodes are based on the AMD™ EPYC™ 7763 processor in a dual-socket configuration. Each processor is a multi-chip design with 8 chiplets (Core CompleX's). Each chiplet has 8 cores and its own 32 MB L3 cache. Every core in the chiplet has its own L1 and L2 cache, and provides 2 hardware threads. There are 16 hardware threads available per chiplet, a total of 64 cores 128 threads per processor, and 128 cores 256 threads per node. Here is some cache and performance infromation for individual CPU's.

Node CPU Base clock freq(GHz) Peak clock freq (GHz) Cores Hardware threads L1 Cache (KB) L2 Cache (KB) L3 cache (MB) FP SIMD width (bits) Peak TFLOPs (FP32)
CPU AMD EPYC 7763 2.45 3.50 64 128 64x32 64x512 8x32 256 ~1.79

Below is an image of a CPU compute blade on Setonix, in this shot there are 8 CPU heatsinks for a total of four nodes per blade.

A CPU blade on Setonix, showing four compute nodes per blade. Each compute node has two CPU sockets.

GPU nodes¶

GPU nodes on Setonix have one AMD 7A53 'Trento' CPU processor and four MI250X GPU processors. The CPU is a specially optimized version of the EPYC processor used in the CPU nodes, but otherwise has the same design and architecture. The Instinct™ MI250X processor is also a Multi-Chip Module (MCM) design, with two graphics dies (otherwise known as Graphics Complex Dies) per processor, as shown below.

AMD Instinct™ MI250X compute architecture, showing two GPU devices per processor. Image credit: AMD Instinct™ MI200 Series Accelerator and Node Architectures | Hot Chips 34

Each of the two dies (GCD's) in a MI250X appears to HIP as a individual compute device with its own 64 GB of global memory and 8MB of L2 cache. Therefore since there are four MI250X's, there are a total of 8 GPU compute devices visible to HIP per GPU node. Every one of the 8 compute devices has 110 compute units, and every compute unit executes instructions over a bank of 4x16 floating point SIMD units that share a 16KB L1 cache, as seen below:

Close-up of an AMD Instinct MI250X compute unit.

The interesting thing to note with these compute units is that both 64-bit and 32-bit floating instructions are executed natively at the same rate. Therefore only the increased bandwidth requirements for moving 64-bit numbers around is a consideration for performance. Below is a table of performance numbers for each of the four MI250X processors in a node.

Card Boost clock (GHz) Compute Units FP32 Processing Elements FP64 Processing Elements (equivalent compute capacity) L1 Cache (KB) L2 Cache (MB) device memory (GB) Peak Tflops (FP32) Peak Tflops (FP64)
AMD Radeon Instinct MI250x 1.7 2x110 2x7040 2x7040 2x110x16 2x8 2x64 47.9 47.9

Below is an installation image of a GPU compute blade with two nodes. Each node has 1 CPU socket and four GPU sockets.

A GPU blade on Setonix, showing two GPU nodes, each node has one CPU socket and four GPU sockets.

Job queues¶

On Setonix the following queues are available for general use:

Queue Max time limit Processing elements (CPU) Socket Cores processing elements per CPU core Available memory (GB) Number of HIP devices Memory per HIP device (GB)
work 24 hours 256 2 64 2 230 0 0
long 96 hours 256 2 64 2 230 0 0
debug 1 hour 256 2 64 2 230 0 0
highmem 24 hours 256 2 64 2 980 0 0
copy 24 hours 32 1 64 2 118 0 0
gpu 24 hours 256 1 64 2 230 4x2 64

Interactive jobs on GPU nodes¶

When compiling software or running test jobs with GPU's it is helpful to have access to a "live" node. Allocations for the gpu queue on Setonix need a separate allocation. At present this will be the account name followed by -gpu. The following command will set up an interactive job on the gpu queue of Setonix. You can use this to compile software and run interactive jobs on a gpu node of Setonix.

salloc --account ${PAWSEY_PROJECT}-gpu --ntasks 1 --mem 4GB --cpus-per-task 1 --time 1:00:00 --gpus-per-task 1 --partition gpu

Building software for Setonix¶

The main complexity with building HIP enabled applications on Setonix is if you also need MPI support. Otherwise you can load the rocm module and simply use hipcc. Here are some extra modules to load if you also need MPI support.

Software modules¶

There are three main programming environments available on Setonix. Each provides C/C++ and Fortran compilers that build software with knowledge of of the MPI libraries available on Setonix. The PrgEnv-GNU programming environment uses the GNU compilers, PrgEnv-aocc uses the AMD aocc optimising compiler to try and get the best performance from the AMD CPU's on Setonix, and the PrgEnv-cray compilers use the compilers from Cray. Use these commands to find which module to load.

Programming environment command to use
AMD module avail PrgEnv-aocc
Cray module avail PrgEnv-cray
GNU module avail PrgEnv-gnu

When compiling HIP sources you have the choice of either the the ROCM hipcc compiler wrapper or the Cray compiler wrapper CC from PrgEnv-cray. If you use the Cray compiler wrapper you need to swap to the module PrgEnv-cray as the GNU programming environment (PrgEnv-gnu) is loaded by default.

module swap PrgEnv-gnu PrgEnv-cray

Then the following compiler wrappers are available for use to compile source files:

Command Explanation
cc C compiler
CC C++ compiler
ftn FORTRAN compiler

In order to use the GPU-aware MPI library from Cray you also need to load the craype-accel-amd-gfx90a module, which works in all three programming environments. To see which version to load run this command.

module avail craype-accel-amd-gfx90a

Load the module craype-accel-amd-gfx90a then set the environment variable

export MPICH_GPU_SUPPORT_ENABLED=1

Finally, in order to have ROCM software (such as hipcc and rocgdb) and libraries available you need to have the rocm module loaded. To see which one to load, run this command:

module avail rocm

The rocm module is independent of the programming environment module loaded.

Compiling software with HIP and MPI support¶

According to this documentation the AMD compiler wrapper hipcc can be used for compiling HIP source files and is the suggested linker for program objects. In order provide the best chance of reducing compiler issues it is good practice to compile while on a gpu node, either from a batch or interactive job.

Compiling and linking with the hipcc compiler wrapper¶

You can use these compiler flags to bring in the MPI headers and make sure hipcc compiles kernels for the MI250X GPU's on Setonix.

Function flags
Compile -I${MPICH_DIR}/include --offload-arch=gfx90a
Link -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}
Debug (compile and link) -g -ggdb
OpenMP (compile and link) -fopenmp

If you want hipcc to behave like Cray CC, make sure the PrgEnv-cray and craype-accel-amd-gfx90a modules are also loaded. Then you can add the output of this command,

$(CC --cray-print-opts=cflags)

to the hipcc compile flags, and the output of this command,

$(CC --cray-print-opts=libs)

to the hipcc linker flags.

Compiling and linking with the Cray CC compiler wrapper¶

If you are using the Cray compiler wrapper CC you can add these flags to compile and link HIP code for the MI250X GPU's on Setonix. You need to have the rocm module loaded.

Function flags
Compile -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -x hip
Link
Debug (compile and link) -g
OpenMP (compile and link) -fopenmp

Mixing hipcc and Cray compilation¶

From this documentation it is important to ensure that all code links back to the same C++ standard libraries. The command hipconfig --cxx generates extra compile flags that might be useful for including in the build process with the Cray wrapper.

Exercise: compile and run your first MPI-enabled HIP application¶

In the files hello_devices_mpi.cpp and hello_devices_mpi_onefile.cpp are files to implement an MPI-enabled HIP application that reports on devices and fills a vector. The difference between the two is that for hello_devices_mpi.cpp has the kernel located in a separate file kernels.hip.cpp. Your task is to compile these files into two executables, hello_devices_mpi.exe and hello_devices_mpi_onefile.exe.

Compilation steps¶

  1. Log into setonix.pawsey.org.au.
  2. Use cd to change directory to your temporary file location in /scratch.
  3. Clone the course material from Github if don't already have it.
    git clone git@github.com:pelagos-consulting/HIP_Course.git
  4. Change directory to course_material/L2_Using_HIP_On_Setonix.
  5. Get an interactive job on the GPU queue of Setonix with this command:
    salloc --account ${PAWSEY_PROJECT} --ntasks 1 --mem 4GB --cpus-per-task 1 --time 1:00:00 --gpus-per-task 1 --partition gpu
  6. Load the rocm module
    module load rocm
  7. Swap out the PrgEnv-gnu module for the PrgEnv-cray module
    module swap PrgEnv-gnu PrgEnv-cray
  8. Load the craype-accel-amd-gfx90a module
    module load craype-accel-amd-gfx90a

Compile the kernel and main program in separate files¶

  1. Compile the kernel file kernels.hip.cpp
    hipcc -c kernels.hip.cpp --offload-arch=gfx90a -o kernels.o
  2. Use CC to compile the file hello_devices_mpi.cpp. Make sure to include the location of the hip_helper.hpp library, located in ../include.
    CC -c -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -I../include -x hip hello_devices_mpi.cpp -o hello_devices_mpi.o
  3. Use hipcc to link the object files together in a way that is aware of the MPI library.
    hipcc kernels.o hello_devices_mpi.o -o hello_devices_mpi.exe -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}

Compile the combined file in one go using hipcc¶

hipcc -I${MPICH_DIR}/include -I../include --offload-arch=gfx90a hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_hipcc.exe -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}

Compile the combined file in one go using CC¶

CC -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -I../include -x hip hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_CC.exe

Run the compiled code¶

Using srun we can run the executable files. If we don't use srun then it will pick up all the GPU's on a node.

  1. srun ./hello_devices_mpi.exe
  2. srun ./hello_devices_mpi_onefile_hipcc.exe
  3. srun ./hello_devices_mpi_onefile_CC.exe

Bonus task: Try running these programs with and without srun to see what happens.¶

The answer¶

If you get stuck, the example Makefile contains the above compilation steps. Assuming you loaded the right modules defined above, the make command is run as follows:

make clean; make

The script run_compile.sh contains the necessary commands to load the appropriate modules and run the make command.

chmod 700 run_compile.sh
   ./run_compile.sh

Tips for batch jobs with HIP on GPU nodes¶

Pawsey has extensive documentation available for running jobs, at this site. Here is some information that is specific to making best use of the GPU nodes on Setonix.

GPU node configuration¶

On the GPU nodes of Setonix there is 1 CPU and 8 compute devices. Each of the 8 chiplets in the CPU is intended to have optimal access to one of the 8 available GPU compute devices. Shown below is a hardware diagram of a compute node, where each chiplet has optimal access to one compute device.

Overall view of a Setonix GPU node, showing the placement of hardware threads and the closest available compute device.

Work is still being done on making sure that MPI processes map optimally to available compute devices. These suggestions will help space out the MPI tasks so each task resides on its own chiplet.

  • Use --ntasks-per-node=8 to allocate up to 8 MPI tasks per node, one per compute device.
  • Use --gpus-per-task=1 to allocate 1 compute device per MPI task.
  • Use --cpus-per-task=8 and --threads-per-core=1 to allocate all available threads in a chiplet to a single MPI process.
  • Use the --gpu-bind=closest option to bind each compute device to the closest MPI task.

Example job script¶

The suggested job script below will allocate an MPI task for every compute device on a node of Setonix. Then it will allocate 8 OpenMP threads to each MPI task. We can use the helper program hello_jobstep.cpp adapted from a program by Thomas Papatheodore from ORNL. Every software thread executed by the program reports the MPI rank, OpenMP thread, the CPU hardware thread, as well as the GPU and BUS ID's of the GPU hardware.

#!/bin/bash -l

#SBATCH --account=<account>-gpu    # your account
#SBATCH --partition=gpu            # Using the gpu partition
#SBATCH --ntasks=8                 # Total number of tasks
#SBATCH --ntasks-per-node=8        # Set this for 1 mpi task per compute device
#SBATCH --cpus-per-task=8          # How many OpenMP threads per MPI task
#SBATCH --threads-per-core=1       # How many OpenMP threads per core
#SBATCH --gpus-per-task=1          # How many HIP compute devices to allocate to a  task
#SBATCH --gpu-bind=closest         # Bind each MPI taks to the nearest GPU
#SBATCH --mem=4000M                #Indicate the amount of memory per node when asking for share resources
#SBATCH --time=01:00:00

module swap PrgEnv-gnu PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   #To define the number of OpenMP threads available per MPI task, in this case it will be 8
export OMP_PLACES=cores     #To bind to cores 
export OMP_PROC_BIND=close  #To bind (fix) threads (allocating them as close as possible). This option works together with the "places" indicated above, then: allocates threads in closest cores.

# Temporal workaround for avoiding Slingshot issues on shared nodes:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)

# Run a job with task placement and $BIND_OPTIONS
#srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS $BIND_OPTIONS  ./hello_jobstep.exe
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./hello_jobstep.exe | sort

In the file jobscript.sh is a batch script for the information above. Edit the \<account> infomation to include the account to charge to and then run the script with

sbatch jobscript.sh

Have a look at the .out file and examine how the threads and GPU's are placed.